CLAIMS 

11. A controller fault recovery system for an array storage system for storing data objects 

2 including at least one parity group having a number N of data blocks and a parity block 

3 computed from the N data blocks comprising: 

4 an array of storage devices; 

5 at least N+1 controllers, each controller operably connected to a unique portion of 

6 the array of storage devices; and 

7 a distributed file system having at least one input/output manager (lOM) routine 

8 for each controller, each lOM routine including: 

9 means for controlling access to the unique portion of the array of storage 

1 0 devices associated with that controller; 

1 1 means for maintaining a journal reflecting a state of all requests and 

1 2 commands received and issued for that lOM routine; and 

1 3 means for reviewing the journal and the state of all requests and 

14 commands received and issued for that lOM routine in response to a notification 

1 5 that at least one of the lOM routines has experienced an unscheduled stop and 

1 6 publishing any unfinished request and commands for the at least one failed lOM 

17 routine. 

1 2. The system of claim 1 wherein if the notification is an external notification that a single 

2 lOM has failed, the distributed file system keeps running and the fauh recovery system uses a 

3 proxy arrangement to recovery, and wherein if the notification is a notification that all of the 

4 lOMs experienced an unscheduled stop because more than one lOM has failed, the distributed 

5 file system is restarted and the notification occurs internally as part of the restart procedure for 

6 eachlOM. 
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1 3. The system of claim 2 wherein the proxy arrangement assigns a proxy lOM routine for 

2 each lOM routine, the proxy lOM routine being an lOM routine different than the assigned lOM 

3 routine that includes: 

4 means for monitoring the assigned lOM routine for a failure and, in the 

5 event of a failure of only the assigned lOM routine, issuing a notification to all 

6 other lOM routines; 

7 means for receiving from all other lOM routines identifications of any 

8 unfinished requests or commands for the assigned lOM routine that has failed; 

9 and , 

1 0 means for marking in a meta-data block for the assigned lOM routine a 

1 1 state of any data blocks, parity blocks or meta-data blocks associated with the 

12 unfinished requests or commands reflecting actions needed for each such block 

1 3 when the assigned lOM routine recovers. 

1 4. The system of claim 2 wherein in the event of a failure of more than one of the lOM 



2 routines the distributed file system performs an unscheduled stop of all lOM routines and, upon 

3 recovery of at least N of the lOM routines after the unscheduled stop, each TOM routine reviews 

4 the journal and state for that lOM routine and the publication of any unfinished requests or 

5 commands for that lOM routine from all of the other lOM routines and reconstructs each data 

6 block, parity block or metadata block in response, so as to insure that any updates to a block of 

7 data and its block of parity are atomic. 



1 5. A computer-implemented method of storing data objects in an array storage system, the 

2 data objects including at least one parity group having a number N of data blocks and a parity 

3 block computed from the N data blocks, wherein the data objects are stored in the array storage . 

4 system under software control of a distributed file system having at least a nvmiber N+1 of 

5 input/output manager (lOM) routines, each lOM routine controlling access to a unique portion of 

6 the array storage system and having a plurality of buffers to temporarily store blocks to be 

7 transferred in/out of that portion of the array storage system, the method comprising: 

8 (a) receiving a write request at a first lOM to store a new data block and, in 

9 response: 

4-2. 



10 (al ) issuing a read command to read an old data block corresponding to 

11 the new data block if the old data block is not already in a first buffer in the first 

12 lOM, the old data block having a first location in a meta-data structure for the 

13 distributed file system that contains an old disk address for the old data block; 

14 (a2) allocating a new disk address for the new data block; 

1 5 (a3) transferring the new data block into a second buffer in the first 

16 lOM; 

17 (a4) issuing a write command to write the new data block fi*om the 

1 8 second buffer at the new disk address; 

19 (a5) making ajoumal entry that the write command was issued; 

20 (a6) sending an update parity request to a second lOM associated with a 

2 1 parity block of the parity group that includes the old data block; 

22 (a7) determining changes between the old data block and the new data 

23 block; 

24 (a8) sending the changes between the old diata block and the new data 

25 block to the second lOM in response to a request from the second lOM; 

26 (a9) releasing the first buffer in response to a confirmation from the 

27 second lOM that the changes between the old data block and the new data block 

28 have been received; 

29 (al 0) receiving a response to the write command that the new data block 

30 has been written; 

31 (al 1) changing the old disk address to the new disk address in the first 

32 location in the meta-data structure; 

33 (a 12) releasing the second buffer; 

34 (a 13) sending a message to the second lOM that the write command was 

35 completed; and 

36 (al 4) deallocating space reserved for the old data block in the meta-data 

37 structure and making ajoumal entry that the write command was completed in 
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38 response to receiving a message from the second lOM that the parity update was 

39 completed; and 

40 (b) receiving the update parity request at the second lOM and, in response: 

41 (b 1 ) making a j oumal entry that the update parity request was received; 

42 (b2) sending a request to the first lOM for the changes between the old 

43 data block and the new data block; 

44 (b3) issuing a read command to read an old parity block corresponding 

45 to the parity block of the parity group that includes the old data block if the parity 

46 block is not already in a first buffer in the second lOM, the old parity block 

47 having a second location in the meta-data structure for the distributed file system 

48 that contains an old disk address for the old parity block; 

49 (b4) receiving in a second buffer in the second lOM the changes 

50 between the old data block and the new data block from the first lOM and sending 

5 1 the confirmation that the changes between the old data block and the new data 

52 block have been received; 

53 (b5) allocating a new disk address for a new parity block and a third 

54 buffer in the second lOM; 

55 (b6) generating the new parity block in the third buffer based on the 

56 changes between the old data block and the new data block and the old parity 

57 block; 

58 (b7) releasing the first buffer and the second buffer; 

59 (b8) issuing a write command to write the new parity block in the third 

60 buffer to the new disk address for the new parity block; 

61 (b9) receiving a response to the write command that the new parity 

62 block has been vmtten; 

63 (blO) changing the old disk address for the old parity block to the new 

64 disk address for the new parity block in the second location in the meta-data 

65 structure; 
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66 (bll) making ajoumal entry that the update parity request was 

67 completed; 

68 (bl 2) sending a message to the first lOM that the update parity request 

69 was completed; and 

70 (b 1 3) deallocating space reserved for the old parity block in the meta- 

7 1 data structure in response to receiving a message fi'om the first lOM that the write 

72 conmiand was completed. 

1 6. The method of claim 5 wherein the write request to store the new data block includes a 

2 toughness value that is an indication of when a write request complete response should be 

3 returned based on a user chosen value of the desired balance between response time and 

4 reliability and the method further comprises: 

5 if the toughness value requires the new data block and the new parity 

6 block to be confirmed as written to the redundant storage array, returning a write 

7 request complete response after step (al 3); and 

8 , if the toughness value requires atomicity of the new data block and the 

9 new parity block but does not require the new data block and the new parity block 

10 to be confirmed as written to the redundant storage array, returning a write request 

1 1 complete response after step (a8). 

1 7. The method of claim 6 wherein the method further comprises: 

2 if the toughness value does not requires atomicity of the new data block 

3 and the new parity block, returning a write request complete response after step 

4 (a4). 

1 8. The method of claim 5 wherein step (a7) is accomplished by comparing the contents of 

2 the first buffer and the second buffer in the first lOM and generating a delta block that is stored 

3 in a third buffer in the first lOM and wherein step (b4) stores the delta block in the second buffer 
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in the second lOM and step (b5) compares the old parity block in the first buffer in the parity 
lOM with the delta block in the second buffer of the second lOM and generates the new parity 
block that is stored in the third buffer in the second lOM. 



1 9. The method of claim 5 wherein the array storage system is not provided with non- volatile 

2 random access memory for the joumal and wherem at least steps (a5), (al4), (bl) and (bl 1) 

3 further comprise the step of receiving a response that the joumal has been flushed to the array 

4 storage system. 

1 10. The method of claim 5 wherein at least steps (al 1), (al4), (blO) and (bl3) further 

2 comprise the step of receiving a response that the metadata has been flushed to the array storage 

3 system. 

1 11. The method of claim 5 wherein the array storage system is provided with non-volatile 

2 random access memory (NVRAM) and the joumal is recorded in the NVRAM. 

1 1 2. The method of claim 5 wherein the method further comprises: 

2 . (c) in the event of an unscheduled stop of the array storage system, recovering 

3 the data parity group of a data object by reviewing the joumal entries for both the first 

4 lOM and the second lOM and reconstructing the data block or the parity block in 

5 response if necessary. 

1 13. The method of claim 1 2 wherein the first lOM is a data lOM for a given parity group and 

2 the second lOM is a parity lOM for the parity group and wherein step (c) comprises: 

3 (c 1 ) making no changes to the data parity group or the location in the 

4 meta-data structure of the old data disk address and the old parity disk address if: 

5 (cla) no joumal entry exist for either the data lOM or the parity 

6 lOM for the data parity group; 
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7 (clb) ajoumal entry exists that the write command was issued 

8 but not completed and there is no journal entry that the parity update 

9 request was issued; 

10 (c 1 c) a journal entry exists that the write command was issued 

1 1 but not completed and ajoumal entry exists that the parity update request 

12 was issued but not completed; or 

13 (eld) ajoumal entry exists that the write conmiand was 

1 4 completed and a journal entry exists that the parity update request was 

15 completed; 

16 (c2) reconstructing the new parity block from the old data block and the 

1 7 new data block if a journal entry exists for the data lOM that the write command 

1 8 was completed and no j oumal entry exists for the parity lOM that the update 

1 9 parity request was completed; and 

20 (c3) reconstructing changes to the parity from the old parity block and 

21 the new parity block and then reconstructing the new data block from the old data 

22 block and the changes to the parity if a journal entry exists for the data TOM that 

23 the write command was issued but not completed and a journal entry exists for the 

24 parity lOM that the update parity request was completed. 

1 14. The method of claim 1 3 wherein the journal entry of at least steps (a5), (al4), (bl) and 



2 (bl 1) includes the corresponding new disk address and old disk address for the data or parity, 

3 respectively, and wherein step (c) further comprises the step of determining whether the new 

4 disk address or the old disk address matches a disk address in the meta-data structure for the data 

5 or parity, respectively. 

1 15. A computer-implemented method of storing data objects in an array storage system, the 

2 data objects including at least one parity group having a number N of data blocks and a parity 

3 block computed from the N data blocks, wherein the data objects are stored in the array storage 
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4 system under software control of a distributed file system having at least a number N+1 of 

5 input/output manager (lOM) routines, each lOM routine controlling access to a unique portion of 

6 the array storage system, the method comprising: 

7 atafirstlOM: 

8 receiving a write request to store a new block of data for a data object; 

9 issuing an update parity request to a second lOM associated with a parity 

10 block corresponding to the new block of data; 

1 1 issuing a write command to write the new block of data to the storage 

12 system; and 

13 receiving a write command complete from the storage system for the new 

14 block of data; 

15 at the second lOM: 

1 6 receiving the update parity request; 

1 7 computing a new block of parity for the parity group that includes the new 

18 block of data; and 

19 issuing a write command to write the new block of parity; and 

20 receiving a write command complete from the storage system for the new 

21 block of parity; 

22 for each of the first and second lOM, maintaining a journal of all requests and 

23 commands received and issued; and 

24 in the event of an unscheduled stop of the array storage system, recovering the 

25 data parity group of the data object by reviewing the joumal entries for both the first and 

26 second lOM and reconstructing the data block or the parity block in response if 

27 necessary. 

1 15. The method of claim 1 5 wherein each lOM makes a joumal entry in response to: 

2 a write data conmiand fi-om a requestor; 

3 an update parity request fi-om another lOM; 
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a write data command complete from the array storage system; 

a write parity command issued in response to the update parity request from 
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another lOM; 
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a write parity command complete from the array storage system; and 
an update parity request complete from another lOM. 
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1 1 6. The method of claim 1 5 wherein each lOM maintains its own journal, 

1 17. A computer-implemented method of storing data objects in an array storage system, the 

2 data objects including at least one parity group having a number N of data blocks and a parity 

3 block computed from the N data blocks, wherein the data objects are stored in the array storage 

4 system under software control of a distributed file system having at least a number N+1 of 

5 input/output manager (lOM) routines, each lOM routine controlling access to a unique portion of 

6 the storage system, the method comprising: 

7 atafirstlOM: 

8 receiving a write request to store a new block of data for a data object; 

9 issuing an update parity request to a second lOM associated with a parity 

1 0 block corresponding to the new block of data; 

1 1 issuing a write command to write the new block of data to the storage 

12 system; and 

13 receiving a write command complete from the storage system for the new 

14 block of data; 

15 at the second lOM: 

16 receiving the update parity request; 

1 7 computing a new block of parity for the parity group that includes the new 

1 8 block of data; and 

19 issuing a write command to write the new block of parity; and 



20 receiving a write command complete from the storage system for the new 

21 block of parity; 

22 at a third lOM that is designated as a proxy for the first lOM: 

23 monitoring requests to the first lOM; and 

24 in the event that the first lOM does not respond to a request, assunwng 

25 responsibility for responding to the request; 

26 for each of the first and second lOMs, maintaining a journal of all requests and 

27 commands received and issued; and 

28 in the event that the first lOM does not respond to a request, recovering the data 

29 parity group of the data object by reviewing the journal entries for both the first and 

30 second lOM and reconstructing the data block from the parity block. 

1 18. An array storage system for storing data objects including at least one parity group 

2 having a number N of data blocks and a parity block computed from the N data blocks 

3 comprising: 

4 an array of storage devices; and 

5 a distributed file system having at least a number N+1 of input/output manager 

6 (lOM) routines, each lOM routine controlling access to a unique portion of the array of 

7 storajge devices and maintaining a joumal of all requests and commands received and 

8 issued for that lOM wherein 

9 a first lOM responds to a write request to store a new block of data for a 

10 data object and issues an update parity request to a second lOM associated with a 

1 1 parity block corresponding to the block of data; 

12 the second lOM computes a new block of parity for the parity group that 

13 includes the new block of data; and 

14 in the event of an unscheduled stop of a portion of the array of storage 

15 devices controlled by either the first lOM or second lOM, the distributed file 

1 6 system reviews the joumal for both the first lOM and second lOM once the first 
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and second lOM are both restarted and reconstructs the data block or the parity 
block in response so as to insure that any updates to the new block of data or new 
block of parity are atoinic. 



1 19. The system of claim 18 wherein the distributed file system further includes a third lOM 

2 designated as a proxy to the first lOM and wherein, in the event that the first lOM does not 

3 respond to a request and another lOM publishes that the first lOM is not available, the third lOM 

4 assumes responsibility for responding to the request and marks a meta-data file associated with 

5 the first lOM that any write requests for data blocks and parity blocks to the first lOM that were 

6 handled by the third lOM need to be reconstructed once the first lOM is restarted. 

1 20. The system of claim 1 9 wherein the third lOM assumes responsibility for the first by 

2 starting a new lOM routine on the controller for the third lOM where the new lOM routine acts 

3 as a proxy lOM for the first lOM. 

1 21 . The system of claim 18 wherein each of the lOMs other than the first lOM respond to the 

2 publication that the first lOM is not available by reviewing the journal and state for that lOM and 

3 publishing any unfinished requests or commands between that lOM and the first lOM, 

1 
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